32 research outputs found
Faster Robust Tensor Power Method for Arbitrary Order
Tensor decomposition is a fundamental method used in various areas to deal
with high-dimensional data. \emph{Tensor power method} (TPM) is one of the
widely-used techniques in the decomposition of tensors. This paper presents a
novel tensor power method for decomposing arbitrary order tensors, which
overcomes limitations of existing approaches that are often restricted to
lower-order (less than ) tensors or require strong assumptions about the
underlying data structure. We apply sketching method, and we are able to
achieve the running time of , on the power and
dimension tensor. We provide a detailed analysis for any -th order
tensor, which is never given in previous works
Attention Scheme Inspired Softmax Regression
Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit. This operation is
important in LLMs because it allows the model to generate a distribution over
possible next words or phrases, given a sequence of input words. This
distribution is then used to select the most likely next word or phrase, based
on the probabilities assigned by the model. The softmax unit plays a crucial
role in training LLMs, as it allows the model to learn from the data by
adjusting the weights and biases of the neural network.
In the area of convex optimization such as using central path method to solve
linear programming. The softmax function has been used a crucial tool for
controlling the progress and stability of potential function [Cohen, Lee and
Song STOC 2019, Brand SODA 2020].
In this work, inspired the softmax unit, we define a softmax regression
problem. Formally speaking, given a matrix and
a vector , the goal is to use greedy type algorithm to
solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1}
\exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence
result provides theoretical support for why we can use greedy algorithm to
train softmax function in practice
Convergence of Two-Layer Regression with Nonlinear Units
Large language models (LLMs), such as ChatGPT and GPT4, have shown
outstanding performance in many human life task. Attention computation plays an
important role in training LLMs. Softmax unit and ReLU unit are the key
structure in attention computation. Inspired by them, we put forward a softmax
ReLU regression problem. Generally speaking, our goal is to find an optimal
solution to the regression problem involving the ReLU unit. In this work, we
calculate a close form representation for the Hessian of the loss function.
Under certain assumptions, we prove the Lipschitz continuous and the PSDness of
the Hessian. Then, we introduce an greedy algorithm based on approximate Newton
method, which converges in the sense of the distance to optimal solution. Last,
We relax the Lipschitz condition and prove the convergence in the sense of loss
value
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension
Large language models (LLMs) have shown their power in different areas.
Attention computation, as an important subroutine of LLMs, has also attracted
interests in theory. Recently the static computation and dynamic maintenance of
attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and
Zhou 2023] from both algorithmic perspective and hardness perspective. In this
work, we consider the sparsification of the attention problem. We make one
simplification which is the logit matrix is symmetric. Let denote the
length of sentence, let denote the embedding dimension. Given a matrix , suppose and with , then we aim for finding (where ) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) -
D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two
results for this problem.
Our first result is a randomized algorithm. It runs in
time, has succeed
probability, and chooses . Here
denotes the number of non-zero entries in . We use to denote the
exponent of matrix multiplication. Currently .
Our second result is a deterministic algorithm. It runs in
time and chooses . Here denote the -th column
of matrix .
Our main findings have the following implication for applied LLMs task: for
any super large feature dimension, we can reduce it down to the size nearly
linear in length of sentence
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention
Large transformer models have achieved state-of-the-art results in numerous
natural language processing tasks. Among the pivotal components of the
transformer architecture, the attention mechanism plays a crucial role in
capturing token interactions within sequences through the utilization of
softmax function.
Conversely, linear attention presents a more computationally efficient
alternative by approximating the softmax operation with linear complexity.
However, it exhibits substantial performance degradation when compared to the
traditional softmax attention mechanism.
In this paper, we bridge the gap in our theoretical understanding of the
reasons behind the practical performance gap between softmax and linear
attention. By conducting a comprehensive comparative analysis of these two
attention mechanisms, we shed light on the underlying reasons for why softmax
attention outperforms linear attention in most scenarios
Clustered Linear Contextual Bandits with Knapsacks
In this work, we study clustered contextual bandits where rewards and
resource consumption are the outcomes of cluster-specific linear models. The
arms are divided in clusters, with the cluster memberships being unknown to an
algorithm. Pulling an arm in a time period results in a reward and in
consumption for each one of multiple resources, and with the total consumption
of any resource exceeding a constraint implying the termination of the
algorithm. Thus, maximizing the total reward requires learning not only models
about the reward and the resource consumption, but also cluster memberships. We
provide an algorithm that achieves regret sublinear in the number of time
periods, without requiring access to all of the arms. In particular, we show
that it suffices to perform clustering only once to a randomly selected subset
of the arms. To achieve this result, we provide a sophisticated combination of
techniques from the literature of econometrics and of bandits with constraints
Solving Tensor Low Cycle Rank Approximation
Large language models have become ubiquitous in modern life, finding
applications in various domains such as natural language processing, language
translation, and speech recognition. Recently, a breakthrough work [Zhao,
Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from
probabilistic context-free grammar (PCFG). One of the central computation task
for computing probability in PCFG is formulating a particular tensor low rank
approximation problem, we can call it tensor cycle rank. Given an third order tensor , we say that has cycle rank- if there
exists three size matrices , and such that for each
entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k
U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) }
\end{align*} for all . For the tensor
classical rank, tucker rank and train rank, it has been well studied in [Song,
Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous
``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA
2019] and show an input sparsity time algorithm for cycle rank
Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights
In the realm of deep learning, transformers have emerged as a dominant
architecture, particularly in natural language processing tasks. However, with
their widespread adoption, concerns regarding the security and privacy of the
data processed by these models have arisen. In this paper, we address a pivotal
question: Can the data fed into transformers be recovered using their attention
weights and outputs? We introduce a theoretical framework to tackle this
problem. Specifically, we present an algorithm that aims to recover the input
data from given attention weights and output by
minimizing the loss function . This loss function captures the
discrepancy between the expected output and the actual output of the
transformer. Our findings have significant implications for the Localized
Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's
design from a security and privacy perspective. This work underscores the
importance of understanding and safeguarding the internal workings of
transformers to ensure the confidentiality of processed data